The innate desire to predict the future stems from the multitude of advantages it delivers. Depending on the quality and processing of the data, the goal can be as easy as predicting a full moon three years from today to as hard as predicting the weather a week from now. The different Machine Learning (ML) techniques that will be described may unveil great insight into how the data can be forecasted or at all. This project shows the process of finding and analyzing what patterns time-series data can hold and the algorithms that discover them.
The purpose of this project is to forecast a dataset 140 days in the future. Using the R programming language and many of its packages at our disposal, we methodically worked through understanding the data structure and the problems it held. The dataset we were using had six different groups of stocks. Each group had seven columns that each represented a different value of a stock price: Open Price, Volume, Low of the day, High of the Day and Close Price. There are 1,622 rows for each group that represent a time period from May of 2011 to October of 2017.
To begin building a model to forecast the future prices, a solid foundation to work from will need to be created. Any data that is missing, unnormal or incorrectly formatted will have to be addressed before it can be used reliably. The process of gathering the data, cleaning it and correcting any errors is described below.
Values to be forecasted for the following series: S01 – Forecast Var01, Var02 S02 – Forecast Var02, Var03 S03 – Forecast Var05, Var07 S04 – Forecast Var01, Var02 S05 – Forecast Var02, Var03 S06 – Forecast Var05, Var07
Loading all the required libraries.
library(utils)
library('readxl')
library('xlsx')
library(tidyr)
library(dplyr)
library(ggplot2)
library(forecast)
library(fma)
library(fpp2)
library(tseries)
library(gridExtra)
library(ggcorrplot)
library(astsa)
library(janitor)
library(timeDate)
library(scales)
Importing data from Excel, and dates are in a numeric format. We are using as.Date to import these, we simply need to set the origin date and for Excel on Windows, the origin date is December 30, 1899 for dates after 1900.
project_in_df <- data.frame(read_excel("Project1data.xls", sheet = "Set for Class"))
project_in_df = mutate(project_in_df, datetime=as.Date(SeriesInd, origin="1899-12-30"))
project_in_df = project_in_df[c(1, 8, 2, 3, 4, 5, 6, 7)]
head(project_in_df)
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 1 40669 2011-05-06 S03 30.64286 123432400 30.34 30.49 30.57286
## 2 40669 2011-05-06 S02 10.28000 60855800 10.05 10.17 10.28000
## 3 40669 2011-05-06 S01 26.61000 10369300 25.89 26.20 26.01000
## 4 40669 2011-05-06 S06 27.48000 39335700 26.82 27.02 27.32000
## 5 40669 2011-05-06 S05 69.26000 27809100 68.19 68.72 69.15000
## 6 40669 2011-05-06 S04 17.20000 16587400 16.88 16.94 17.10000
nrow(project_in_df)
## [1] 10572
# Create separate dataframes for each group
group_S01_df = filter(project_in_df, group == 'S01')
group_S02_df = filter(project_in_df, group == 'S02')
group_S03_df = filter(project_in_df, group == 'S03')
group_S04_df = filter(project_in_df, group == 'S04')
group_S05_df = filter(project_in_df, group == 'S05')
group_S06_df = filter(project_in_df, group == 'S06')
Looking at the data we found that the data is financial stock market data from 6 different companies. The Variables represent the following # Var01 - High of the day # Var02 - Volume # Var03 - Low of the day # Var05 - Open for the day # Var07 - Close for the day
ggplot(group_S01_df, aes(datetime, Var07)) + geom_line(colour="#000099") + scale_x_date(labels = date_format("%b-%Y")) + xlab("datetime") + ylab("Var07")
## Warning: Removed 140 rows containing missing values (geom_path).
ggplot(group_S02_df, aes(datetime, Var07)) + geom_line(colour="#990000") + scale_x_date(labels = date_format("%b-%Y")) + xlab("datetime") + ylab("Var07")
## Warning: Removed 140 rows containing missing values (geom_path).
ggplot(group_S03_df, aes(datetime, Var07)) + geom_line(colour="#009900") + scale_x_date(labels = date_format("%b-%Y")) + xlab("datetime") + ylab("Var07")
## Warning: Removed 140 rows containing missing values (geom_path).
ggplot(group_S04_df, aes(datetime, Var07)) + geom_line(colour="#990099") + scale_x_date(labels = date_format("%b-%Y")) + xlab("datetime") + ylab("Var07")
## Warning: Removed 140 rows containing missing values (geom_path).
ggplot(group_S05_df, aes(datetime, Var07)) + geom_line(colour="#999900") + scale_x_date(labels = date_format("%b-%Y")) + xlab("datetime") + ylab("Var07")
## Warning: Removed 140 rows containing missing values (geom_path).
ggplot(group_S06_df, aes(datetime, Var07)) + geom_line(colour="#009999") + scale_x_date(labels = date_format("%b-%Y")) + xlab("datetime") + ylab("Var07")
## Warning: Removed 140 rows containing missing values (geom_path).
#remove forecast cells for now
#project_in_df <- project_in_df[1:9732,]
group_S01_df <- slice(group_S01_df, 1:1622)
group_S02_df <- slice(group_S02_df, 1:1622)
group_S03_df <- slice(group_S03_df, 1:1622)
group_S04_df <- slice(group_S04_df, 1:1622)
group_S05_df <- slice(group_S05_df, 1:1622)
group_S06_df <- slice(group_S06_df, 1:1622)
#Change numeric sequence to actual dates for graphing
#note, we put back to sequence before writing the files back out to excel
group_S01_df[,1] <- excel_numeric_to_date(group_S01_df[,1])
group_S02_df[,1] <- excel_numeric_to_date(group_S02_df[,1])
group_S03_df[,1] <- excel_numeric_to_date(group_S03_df[,1])
group_S04_df[,1] <- excel_numeric_to_date(group_S04_df[,1])
group_S05_df[,1] <- excel_numeric_to_date(group_S05_df[,1])
group_S06_df[,1] <- excel_numeric_to_date(group_S06_df[,1])
#check for rows with NA's
group_S01_df[rowSums(is.na(group_S01_df))>0,]
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 1537 2017-06-11 2017-06-11 S01 NA 7329600 NA NA NA
## 1538 2017-06-12 2017-06-12 S01 NA 6121400 NA NA NA
## 1607 2017-09-19 2017-09-19 S01 58.83 6337000 NA NA NA
## 1608 2017-09-22 2017-09-22 S01 59.28 3690900 NA NA NA
group_S02_df[rowSums(is.na(group_S02_df))>0,]
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 1537 2017-06-11 2017-06-11 S02 NA 38160300 NA NA NA
## 1538 2017-06-12 2017-06-12 S02 NA 45801300 NA NA NA
## 1607 2017-09-19 2017-09-19 S02 13.26 19465000 NA NA NA
## 1608 2017-09-22 2017-09-22 S02 13.20 16234300 NA NA NA
group_S03_df[rowSums(is.na(group_S03_df))>0,]
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 1537 2017-06-11 2017-06-11 S03 NA 42343600 NA NA NA
## 1538 2017-06-12 2017-06-12 S03 NA 50074700 NA NA NA
## 1607 2017-09-19 2017-09-19 S03 95.43 32026000 NA NA NA
## 1608 2017-09-22 2017-09-22 S03 97.19 38018600 NA NA NA
group_S04_df[rowSums(is.na(group_S04_df))>0,]
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 1537 2017-06-11 2017-06-11 S04 NA 9098800 NA NA NA
## 1538 2017-06-12 2017-06-12 S04 NA 11188200 NA NA NA
## 1607 2017-09-19 2017-09-19 S04 36.72 34330700 NA NA NA
## 1608 2017-09-22 2017-09-22 S04 36.95 7785800 NA NA NA
group_S05_df[rowSums(is.na(group_S05_df))>0,]
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 795 2014-07-01 2014-07-01 S05 NA NA NA NA NA
## 1537 2017-06-11 2017-06-11 S05 NA 16610900 NA NA NA
## 1538 2017-06-12 2017-06-12 S05 NA 19331600 NA NA NA
## 1607 2017-09-19 2017-09-19 S05 90.4 13191900 NA NA NA
## 1608 2017-09-22 2017-09-22 S05 89.9 11766100 NA NA NA
group_S06_df[rowSums(is.na(group_S06_df))>0,]
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 20 2011-06-03 2011-06-03 S06 NA NA NA NA NA
## 1537 2017-06-11 2017-06-11 S06 NA 19885500 NA NA NA
## 1538 2017-06-12 2017-06-12 S06 NA 32570900 NA NA NA
## 1607 2017-09-19 2017-09-19 S06 49.21 13222800 NA NA NA
## 1608 2017-09-22 2017-09-22 S06 48.88 10644000 NA NA NA
From the above results can see missing values in all the variables,we need to impute NA’s with some value. Var01 has 14,Var02 has 2,Var03 has 26,Var05 has 26 and Var07 has 26 missing values.
Here we will be imputing these values with the next value in sequence, e.g., if row 1600 is NA for Var01, take value from row 1601 hopefully, since these are stock values, the value from the next day for the few missing values we have, should be close enough
removeNAs <- function(dfTs)
{
while(nrow(dfTs[rowSums(is.na(dfTs))>0,]) > 0)
{
dfTs <- transmute(dfTs,
SeriesInd = SeriesInd,
Var01 = if_else(is.na(Var01), lead(Var01), Var01),
Var02 = if_else(is.na(Var02), lead(Var02), Var02),
Var03 = if_else(is.na(Var03), lead(Var03), Var03),
Var05 = if_else(is.na(Var05), lead(Var05), Var05),
Var07 = if_else(is.na(Var07), lead(Var07), Var07))
}
print(dfTs[rowSums(is.na(dfTs))>0,])
return(dfTs)
}
group_S01_df <- removeNAs(group_S01_df)
## [1] SeriesInd Var01 Var02 Var03 Var05 Var07
## <0 rows> (or 0-length row.names)
group_S02_df <- removeNAs(group_S02_df)
## [1] SeriesInd Var01 Var02 Var03 Var05 Var07
## <0 rows> (or 0-length row.names)
group_S03_df <- removeNAs(group_S03_df)
## [1] SeriesInd Var01 Var02 Var03 Var05 Var07
## <0 rows> (or 0-length row.names)
group_S04_df <- removeNAs(group_S04_df)
## [1] SeriesInd Var01 Var02 Var03 Var05 Var07
## <0 rows> (or 0-length row.names)
group_S05_df <- removeNAs(group_S05_df)
## [1] SeriesInd Var01 Var02 Var03 Var05 Var07
## <0 rows> (or 0-length row.names)
group_S06_df <- removeNAs(group_S06_df)
## [1] SeriesInd Var01 Var02 Var03 Var05 Var07
## <0 rows> (or 0-length row.names)
summary(group_S01_df)
## SeriesInd Var01 Var02 Var03 Var05 Var07
## Min. :2011-05-06 Min. :23.01 Min. : 1339900 Min. :22.28 Min. :22.55 Min. :22.50
## 1st Qu.:2012-12-10 1st Qu.:29.85 1st Qu.: 5347550 1st Qu.:29.34 1st Qu.:29.60 1st Qu.:29.61
## Median :2014-07-25 Median :35.66 Median : 7895050 Median :35.10 Median :35.36 Median :35.41
## Mean :2014-07-23 Mean :39.43 Mean : 8907092 Mean :38.69 Mean :39.05 Mean :39.09
## 3rd Qu.:2016-02-29 3rd Qu.:48.74 3rd Qu.:11321675 3rd Qu.:47.91 3rd Qu.:48.31 3rd Qu.:48.30
## Max. :2017-10-13 Max. :62.31 Max. :48477500 Max. :61.59 Max. :62.29 Max. :62.14
summary(group_S02_df)
## SeriesInd Var01 Var02 Var03 Var05 Var07
## Min. :2011-05-06 Min. : 9.03 Min. : 7128800 Min. : 8.82 Min. : 8.99 Min. : 8.92
## 1st Qu.:2012-12-10 1st Qu.:12.18 1st Qu.: 27880300 1st Qu.:11.81 1st Qu.:12.01 1st Qu.:12.00
## Median :2014-07-25 Median :14.07 Median : 39767500 Median :13.76 Median :13.91 Median :13.91
## Mean :2014-07-23 Mean :13.99 Mean : 50633098 Mean :13.68 Mean :13.85 Mean :13.84
## 3rd Qu.:2016-02-29 3rd Qu.:15.84 3rd Qu.: 59050900 3rd Qu.:15.52 3rd Qu.:15.72 3rd Qu.:15.67
## Max. :2017-10-13 Max. :39.36 Max. :480879500 Max. :38.28 Max. :39.33 Max. :38.40
summary(group_S03_df)
## SeriesInd Var01 Var02 Var03 Var05 Var07
## Min. :2011-05-06 Min. : 28.00 Min. : 13046400 Min. : 27.18 Min. : 27.48 Min. : 27.44
## 1st Qu.:2012-12-10 1st Qu.: 54.12 1st Qu.: 55186050 1st Qu.: 52.89 1st Qu.: 53.34 1st Qu.: 53.53
## Median :2014-07-25 Median : 76.24 Median : 85595400 Median : 74.92 Median : 75.66 Median : 75.76
## Mean :2014-07-23 Mean : 77.65 Mean : 99387244 Mean : 76.15 Mean : 76.95 Mean : 76.91
## 3rd Qu.:2016-02-29 3rd Qu.: 99.52 3rd Qu.:127036525 3rd Qu.: 97.57 3rd Qu.: 98.53 3rd Qu.: 98.51
## Max. :2017-10-13 Max. :134.54 Max. :470249500 Max. :131.40 Max. :134.46 Max. :133.00
summary(group_S04_df)
## SeriesInd Var01 Var02 Var03 Var05 Var07
## Min. :2011-05-06 Min. :11.80 Min. : 3468900 Min. :11.09 Min. :11.30 Min. :11.09
## 1st Qu.:2012-12-10 1st Qu.:15.99 1st Qu.: 12918725 1st Qu.:15.66 1st Qu.:15.83 1st Qu.:15.80
## Median :2014-07-25 Median :23.45 Median : 17000800 Median :22.77 Median :23.15 Median :23.28
## Mean :2014-07-23 Mean :26.46 Mean : 20757818 Mean :25.83 Mean :26.15 Mean :26.14
## 3rd Qu.:2016-02-29 3rd Qu.:36.42 3rd Qu.: 24015700 3rd Qu.:35.59 3rd Qu.:35.98 3rd Qu.:35.93
## Max. :2017-10-13 Max. :52.62 Max. :233872100 Max. :51.64 Max. :52.28 Max. :52.37
summary(group_S05_df)
## SeriesInd Var01 Var02 Var03 Var05 Var07
## Min. :2011-05-06 Min. : 56.99 Min. : 4156600 Min. : 55.94 Min. : 56.85 Min. : 56.57
## 1st Qu.:2012-12-10 1st Qu.: 78.86 1st Qu.: 11199775 1st Qu.: 77.25 1st Qu.: 77.95 1st Qu.: 78.11
## Median :2014-07-25 Median : 85.95 Median : 14575700 Median : 84.88 Median : 85.33 Median : 85.44
## Mean :2014-07-23 Mean : 84.21 Mean : 16787776 Mean : 82.97 Mean : 83.59 Mean : 83.63
## 3rd Qu.:2016-02-29 3rd Qu.: 90.99 3rd Qu.: 20014325 3rd Qu.: 89.71 3rd Qu.: 90.37 3rd Qu.: 90.41
## Max. :2017-10-13 Max. :104.76 Max. :118023500 Max. :103.95 Max. :104.42 Max. :104.38
summary(group_S06_df)
## SeriesInd Var01 Var02 Var03 Var05 Var07
## Min. :2011-05-06 Min. : 23.41 Min. : 4297000 Min. : 22.58 Min. : 22.91 Min. : 22.88
## 1st Qu.:2012-12-10 1st Qu.: 30.56 1st Qu.: 15427725 1st Qu.: 29.99 1st Qu.: 30.32 1st Qu.: 30.26
## Median :2014-07-25 Median : 37.14 Median : 21795200 Median : 36.67 Median : 36.94 Median : 36.98
## Mean :2014-07-23 Mean : 40.21 Mean : 25733544 Mean : 39.51 Mean : 39.86 Mean : 39.87
## 3rd Qu.:2016-02-29 3rd Qu.: 50.74 3rd Qu.: 31506875 3rd Qu.: 50.06 3rd Qu.: 50.45 3rd Qu.: 50.43
## Max. :2017-10-13 Max. :195.18 Max. :144985700 Max. :189.36 Max. :195.00 Max. :189.72
From the above data summaries we can see all the null values are removed with apropiate data
# Select relevant columns for each group
group_S01_df = select (group_S01_df, matches("SeriesInd|datetime|group|Var01|Var02"))
group_S02_df = select (group_S02_df, matches("SeriesInd|datetime|group|Var02|Var03"))
group_S03_df = select (group_S03_df, matches("SeriesInd|datetime|group|Var05|Var07"))
group_S04_df = select (group_S04_df, matches("SeriesInd|datetime|group|Var01|Var02"))
group_S05_df = select (group_S05_df, matches("SeriesInd|datetime|group|Var02|Var03"))
group_S06_df = select (group_S06_df, matches("SeriesInd|datetime|group|Var05|Var07"))
# Check number of rows
print (c(nrow(group_S01_df), nrow(group_S02_df), nrow(group_S03_df), nrow(group_S04_df), nrow(group_S05_df), nrow(group_S06_df)))
## [1] 1622 1622 1622 1622 1622 1622
# Verify dataframes
head(project_in_df, 10)
## SeriesInd datetime group Var01 Var02 Var03 Var05 Var07
## 1 40669 2011-05-06 S03 30.64286 123432400 30.34000 30.49000 30.57286
## 2 40669 2011-05-06 S02 10.28000 60855800 10.05000 10.17000 10.28000
## 3 40669 2011-05-06 S01 26.61000 10369300 25.89000 26.20000 26.01000
## 4 40669 2011-05-06 S06 27.48000 39335700 26.82000 27.02000 27.32000
## 5 40669 2011-05-06 S05 69.26000 27809100 68.19000 68.72000 69.15000
## 6 40669 2011-05-06 S04 17.20000 16587400 16.88000 16.94000 17.10000
## 7 40670 2011-05-07 S03 30.79857 150476200 30.46428 30.65714 30.62571
## 8 40670 2011-05-07 S02 11.24000 215620200 10.40000 10.45000 10.96000
## 9 40670 2011-05-07 S01 26.30000 10943800 25.70000 25.95000 25.86000
## 10 40670 2011-05-07 S06 28.24000 55416000 27.24000 27.27000 28.07000
head(group_S01_df, 10)
## SeriesInd Var01 Var02
## 1 2011-05-06 26.61 10369300
## 2 2011-05-07 26.30 10943800
## 3 2011-05-08 26.03 8933800
## 4 2011-05-09 25.84 10775400
## 5 2011-05-10 26.34 12875600
## 6 2011-05-13 26.49 11677000
## 7 2011-05-14 26.03 21165300
## 8 2011-05-15 25.16 18809200
## 9 2011-05-16 25.00 22908400
## 10 2011-05-17 24.77 20359100
head(group_S02_df, 10)
## SeriesInd Var02 Var03
## 1 2011-05-06 60855800 10.05
## 2 2011-05-07 215620200 10.40
## 3 2011-05-08 200070600 11.13
## 4 2011-05-09 130201700 11.32
## 5 2011-05-10 130463000 11.46
## 6 2011-05-13 170626200 11.78
## 7 2011-05-14 162995900 11.72
## 8 2011-05-15 154527100 11.47
## 9 2011-05-16 116531200 11.51
## 10 2011-05-17 96149800 11.55
head(group_S03_df, 10)
## SeriesInd Var05 Var07
## 1 2011-05-06 30.49000 30.57286
## 2 2011-05-07 30.65714 30.62571
## 3 2011-05-08 30.62571 30.13857
## 4 2011-05-09 30.25000 30.08286
## 5 2011-05-10 30.04286 30.28286
## 6 2011-05-13 30.40000 30.01571
## 7 2011-05-14 29.88428 29.67429
## 8 2011-05-15 29.69571 30.09286
## 9 2011-05-16 30.01571 29.91857
## 10 2011-05-17 30.13286 29.41857
head(group_S04_df, 10)
## SeriesInd Var01 Var02
## 1 2011-05-06 17.20 16587400
## 2 2011-05-07 17.23 11718100
## 3 2011-05-08 17.30 16422000
## 4 2011-05-09 16.90 31816300
## 5 2011-05-10 16.76 15470000
## 6 2011-05-13 16.83 16181900
## 7 2011-05-14 16.86 15672400
## 8 2011-05-15 16.98 16955600
## 9 2011-05-16 17.23 16715600
## 10 2011-05-17 17.25 18415000
head(group_S05_df, 10)
## SeriesInd Var02 Var03
## 1 2011-05-06 27809100 68.19
## 2 2011-05-07 30174700 68.80
## 3 2011-05-08 35044700 69.34
## 4 2011-05-09 27192100 69.42
## 5 2011-05-10 24891800 69.22
## 6 2011-05-13 30685000 69.65
## 7 2011-05-14 31496700 69.52
## 8 2011-05-15 24884400 69.26
## 9 2011-05-16 18630800 69.35
## 10 2011-05-17 29411900 68.65
head(group_S06_df, 10)
## SeriesInd Var05 Var07
## 1 2011-05-06 27.02 27.32
## 2 2011-05-07 27.27 28.07
## 3 2011-05-08 28.03 28.11
## 4 2011-05-09 28.12 29.13
## 5 2011-05-10 28.90 28.86
## 6 2011-05-13 29.09 28.80
## 7 2011-05-14 28.47 28.08
## 8 2011-05-15 27.99 28.58
## 9 2011-05-16 28.50 28.99
## 10 2011-05-17 28.82 28.08
Lets start looking at the data. First we are making line plots for all the groups and variables we will be forecasting.
SeriesInd1 <- as.Date(group_S01_df$SeriesInd,origin = "1899-12-30")
SeriesInd2 <- as.Date(group_S02_df$SeriesInd,origin = "1899-12-30")
SeriesInd3 <- as.Date(group_S03_df$SeriesInd,origin = "1899-12-30")
SeriesInd4 <- as.Date(group_S04_df$SeriesInd,origin = "1899-12-30")
SeriesInd5 <- as.Date(group_S05_df$SeriesInd,origin = "1899-12-30")
SeriesInd6 <- as.Date(group_S06_df$SeriesInd,origin = "1899-12-30")
p1 <- ggplot(group_S01_df, aes(SeriesInd1, y = Var02, color = variable)) + ggtitle("S01") + geom_line(aes(y = Var02, col = "Var02")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(group_S01_df, aes(SeriesInd1, y = Var01, color = variable)) + ggtitle("S01") + geom_line(aes(y = Var01, col = "Var01")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p3 <- ggplot(group_S02_df, aes(SeriesInd2, y = Var02,color = variable))+ ggtitle("S02") + geom_line(aes(y = Var02, col = "Var02")) +xlab("Time")+ theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p4 <- ggplot(group_S02_df, aes(SeriesInd2, y = Var03,color = variable))+ ggtitle("S02") + geom_line(aes(y = Var03, col = "Var03")) +xlab("Time")+ theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p5 <- ggplot(group_S03_df, aes(SeriesInd3, y = Var05, color = variable)) + geom_line(aes(y = Var05, col = "Var05")) + ggtitle("S03")+ xlab("Time") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p6 <- ggplot(group_S03_df, aes(SeriesInd3, y = Var07, color = variable)) + geom_line(aes(y = Var07, col = "Var07")) + ggtitle("S03") + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p7 <- ggplot(group_S04_df, aes(SeriesInd4, y = Var01, color = variable)) + geom_line(aes(y = Var01, col = "Var01")) + ggtitle("S04") +xlab("Time")+ theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p8 <- ggplot(group_S04_df, aes(SeriesInd4, y = Var02,color = variable))+ ggtitle("S04") + geom_line(aes(y = Var02, col = "Var02")) +xlab("Time")+ theme_bw()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
p9 <- ggplot(group_S05_df, aes(SeriesInd5, y = Var03, color = variable)) + ggtitle("S05") +geom_line(aes(y = Var03, col = "Var03")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p10 <- ggplot(group_S05_df, aes(SeriesInd5, y = Var02,color = variable))+ ggtitle("S05") + geom_line(aes(y = Var02, col = "Var02")) + xlab("Time")+theme_bw()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
p11 <- ggplot(group_S06_df, aes(SeriesInd6, y = Var05, color = variable)) + geom_line(aes(y = Var05, col = "Var05")) +
ggtitle("S06") +xlab("Time")+ theme_bw()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
p12 <- ggplot(group_S06_df, aes(SeriesInd6, y = Var07, color = variable)) + ggtitle("S06") + geom_line(aes(y = Var07, col = "Var07")) + xlab("Time")+theme_bw()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p1, p2, nrow = 1)
grid.arrange(p3, p4, nrow = 1)
grid.arrange(p5, p6, nrow = 1)
grid.arrange(p7,p8, nrow = 1)
grid.arrange(p9,p10, nrow = 1)
grid.arrange(p11,p12, nrow = 1)
Comparing to other variable Var02 seems to be noisy than any other variable and having outliers. These outliers needs to be fixed before producing forecasts S03 and S06 variables Var05 and Var07 seems to be quite similar Also we see in S02 - Var03 and S06-Var05 and Var07 plot some outlier values that also needs to be fixed before forecasting. We can also observe some seasonality and trend pattern in Var01,Var03,Var05 and var07
Removing outliers We are using here IQR to fix the outliers in our data For missing values that lie outside the 1.5*IQR limits, we are capping it by replacing those observations outside the lower limit with the value of 5th %ile and those that lie above the upper limit, with the value of 95th %ile.
Remove_Outlier <- function(x){
repeat{
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
return(x)
if(x < (qnt[1] - H)){break}
if(x > (qnt[1] + H)){break}
}
}
group_S01_df$Var01=Remove_Outlier(group_S01_df$Var01)
group_S01_df$Var02=Remove_Outlier(group_S01_df$Var02)
group_S02_df$Var03=Remove_Outlier(group_S02_df$Var03)
group_S02_df$Var02=Remove_Outlier(group_S02_df$Var02)
group_S04_df$Var01=Remove_Outlier(group_S04_df$Var01)
group_S04_df$Var02=Remove_Outlier(group_S04_df$Var02)
group_S05_df$Var03=Remove_Outlier(group_S05_df$Var03)
group_S05_df$Var02=Remove_Outlier(group_S05_df$Var02)
group_S06_df$Var05=Remove_Outlier(group_S06_df$Var05)
group_S06_df$Var07=Remove_Outlier(group_S06_df$Var07)
Lets plot again plots for the variables to check if the outliers are fixed
p1 <- ggplot(group_S01_df, aes(SeriesInd1, y = Var02, color = variable)) + ggtitle("S01") + geom_line(aes(y = Var02, col = "Var02")) + xlab("Time")+theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(group_S01_df, aes(SeriesInd1, y = Var01, color = variable)) + ggtitle("S01") + geom_line(aes(y = Var01, col = "Var01")) +xlab("Time")+ theme_bw()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
p3 <- ggplot(group_S02_df, aes(SeriesInd2, y = Var02,color = variable))+ ggtitle("S02") + geom_line(aes(y = Var02, col = "Var02")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p4 <- ggplot(group_S02_df, aes(SeriesInd2, y = Var03,color = variable))+ ggtitle("S02") + geom_line(aes(y = Var03, col = "Var03")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p5 <- ggplot(group_S03_df, aes(SeriesInd3, y = Var05, color = variable)) + geom_line(aes(y = Var05, col = "Var05")) +
ggtitle("S03") + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p6 <- ggplot(group_S03_df, aes(SeriesInd3, y = Var07, color = variable)) + geom_line(aes(y = Var07, col = "Var07")) +
ggtitle("S03") + xlab("Time")+theme_bw()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
p7 <- ggplot(group_S04_df, aes(SeriesInd4, y = Var01, color = variable)) + geom_line(aes(y = Var01, col = "Var01")) +
ggtitle("S04") + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p8 <- ggplot(group_S04_df, aes(SeriesInd4, y = Var02,color = variable))+ ggtitle("S04") + geom_line(aes(y = Var02, col = "Var02")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p9 <- ggplot(group_S05_df, aes(SeriesInd5, y = Var03, color = variable)) + ggtitle("S05") + geom_line(aes(y = Var03, col = "Var03")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p10 <- ggplot(group_S05_df, aes(SeriesInd5, y = Var02,color = variable))+ ggtitle("S05") + geom_line(aes(y = Var02, col = "Var02")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p11 <- ggplot(group_S06_df, aes(SeriesInd6, y = Var05, color = variable)) + geom_line(aes(y = Var05, col = "Var05")) +
ggtitle("S06") + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p12 <- ggplot(group_S06_df, aes(SeriesInd6, y = Var07, color = variable)) + ggtitle("S06") + geom_line(aes(y = Var07, col = "Var07")) + xlab("Time")+theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p1, p2, nrow = 1)
grid.arrange(p3, p4, nrow = 1)
grid.arrange(p5, p6, nrow = 1)
grid.arrange(p7,p8, nrow = 1)
grid.arrange(p9,p10, nrow = 1)
grid.arrange(p11,p12, nrow = 1)
The plots looks it has removed the extreme outliers.
The plots for Var01,Var03,Var05,Var07 shows increase in general during the period of time. But there is no obvious pattern in the fluctuation. In other words, there might be seasonality, but an obvious upward trend. Also, the variance is not stable seeing from the plots and it seems to increase. Thus, we may use difference and logarithm or square root transformation on original data to stabilize the variance.
We are using here difference of each value over previous value and difference logarithm transformation for Var02
Lets create plots to have compare how the orignal data and transformed data
par(mfrow=c(1,2))
p1 <- plot(group_S01_df$Var01, type = "l", main = "original dataS01&Var01")
p2 <- plot(diff(group_S01_df$Var01), type = "l", main = "Diff dataS01&Var01")
p3 <- plot(group_S01_df$Var02, type = "l", main = "original dataS01&Var02")
p4 <- plot(diff(log(group_S01_df$Var02)), type = "l", main = "Diff dataS01&Var02")
p5 <- plot(group_S02_df$Var02, type = "l", main = "original dataS02&Var02")
p6 <- plot(diff(log(group_S02_df$Var02)), type = "l", main = "Diff dataS02&Var02")
p7 <- plot(group_S02_df$Var03, type = "l", main = "originaldataS02&Var03")
p8 <- plot(diff(group_S02_df$Var03), type = "l", main = "DiffdataS02&Var03")
p9 <- plot(group_S03_df$Var05, type = "l", main = "original dataS03&Var05")
p10 <- plot(diff(group_S03_df$Var05), type = "l", main = "Diff dataS03&Var05")
p11 <- plot(group_S03_df$Var07, type = "l", main = "original dataS03&Var07")
p12 <- plot(diff(group_S03_df$Var07), type = "l", main = "Diff dataS03&Var07")
p13 <- plot(group_S04_df$Var01, type = "l", main = "original dataS04&Var01")
p14 <- plot(diff(group_S04_df$Var01), type = "l", main = "Diff dataS04&Var04")
p15 <- plot(group_S04_df$Var02, type = "l", main = "original dataS04&Var02")
p16 <- plot(diff(log(group_S04_df$Var02)), type = "l", main = "Diff dataS04&Var02")
p17 <- plot(group_S05_df$Var02, type = "l", main = "original dataS05&Var02")
p18 <- plot(diff(log(group_S05_df$Var02)), type = "l", main = "Diff dataS05&Var02")
p19 <- plot(group_S05_df$Var03, type = "l", main = "original dataS05&Var03")
p20 <- plot(diff(group_S05_df$Var03), type = "l", main = "DiffdataS05&Var03")
p21 <- plot(group_S06_df$Var05, type = "l", main = "original dataS06&Var05")
p22 <- plot(diff(group_S06_df$Var05), type = "l", main = "Diff dataS06&Var05")
p23 <- plot(group_S06_df$Var07, type = "l", main = "original dataS06&Var07")
p24 <- plot(diff(group_S06_df$Var07), type = "l", main = "DiffdataS06&Var07")
Data looks like it eliminated noise compared to the orginal. Data looks more stationary after applying differencing to Var01,Var02,Var03,Var05,Var07.
ARIMA (autoregressive integrated moving average) is a commonly used technique utilized to fit time series data and forecasting. It is a generalized version of ARMA (autoregressive moving average) process, where the ARMA process is applied for a differenced version of the data rather than original.
Three numbers p, d and q specify ARIMA model and the ARIMA model is said to be of order (p,d,q). Here p, d and q are the orders of AR part, Difference and the MA part respectively.
AR and MA- both are different techniques to fit stationary time series data. ARMA (and ARIMA) is a combination of these two methods for better fit of the model.
First let’s get the dataframe data into a timeseries for variable 02. The data is stored as a series of dates, 5 days with a break after, so essentially 52 weeks times 5 days will divide it up nicely into 7 years of data. We will then fit the data using auto.arima to find the appropriate values for P,D,Q (p = the number of lag observations, d = degree of differencing, and q = size of the moving average window, for lagged forecast errors ).
We see that a 1,1,2 model appears to be most appropriate.
h <- 140
TsVar02 <- ts(group_S05_df$Var02, frequency = 5*52)
#TsVar02 <- ts(group_S01_df[,1:2])
TsVar02 <- ts(group_S05_df$Var02, start = c(2011, 85), end = c(2017, 166), frequency = 260)
arimaFitVar02 <- auto.arima(TsVar02, seasonal = FALSE)
arimaFitVar02
## Series: TsVar02
## ARIMA(1,1,2)
##
## Coefficients:
## ar1 ma1 ma2
## 0.7600 -1.3275 0.3646
## s.e. 0.0619 0.0725 0.0587
##
## sigma^2 estimated as 1.545e+13: log likelihood=-27244.99
## AIC=54497.99 AICc=54498.01 BIC=54519.6
We will use the package Applied Statistical Time Series Analysis (astsa) and it’s wrapper module, sarima around the arima set of tools to do our forecasting.
Looking at the residuals from the package output, we see that they look like white noise, which is what we hope, as we don’t wish to see a pattern. The ACF of the residuals look acceptable too. There are a couple of small spikes just above the threshold of significance, but otherwise fine.
Note the box test results show that there is no correlation as all value are outside the threshold.
sarima(TsVar02, 1, 1, 2)
## initial value 15.326257
## iter 2 value 15.244408
## iter 3 value 15.199934
## iter 4 value 15.197369
## iter 5 value 15.197323
## iter 6 value 15.197246
## iter 7 value 15.196955
## iter 8 value 15.196021
## iter 9 value 15.194363
## iter 10 value 15.192923
## iter 11 value 15.192371
## iter 12 value 15.191646
## iter 13 value 15.191477
## iter 14 value 15.191121
## iter 15 value 15.190463
## iter 16 value 15.189146
## iter 17 value 15.187523
## iter 18 value 15.186481
## iter 19 value 15.186372
## iter 20 value 15.186201
## iter 21 value 15.186107
## iter 22 value 15.185764
## iter 23 value 15.185437
## iter 24 value 15.185324
## iter 25 value 15.185287
## iter 26 value 15.185280
## iter 27 value 15.185279
## iter 27 value 15.185279
## final value 15.185279
## converged
## initial value 15.184058
## iter 2 value 15.184037
## iter 3 value 15.184028
## iter 4 value 15.184001
## iter 5 value 15.183934
## iter 6 value 15.183843
## iter 7 value 15.183756
## iter 8 value 15.183735
## iter 9 value 15.183735
## iter 9 value 15.183735
## iter 9 value 15.183735
## final value 15.183735
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, optim.control = list(trace = trc, REPORT = 1,
## reltol = tol))
##
## Coefficients:
## ar1 ma1 ma2 constant
## 0.7586 -1.3264 0.3638 -1470.274
## s.e. 0.0619 0.0724 0.0587 15191.714
##
## sigma^2 estimated as 1.542e+13: log likelihood = -27244.99, aic = 54499.97
##
## $degrees_of_freedom
## [1] 1637
##
## $ttable
## Estimate SE t.value p.value
## ar1 0.7586 0.0619 12.2483 0.0000
## ma1 -1.3264 0.0724 -18.3114 0.0000
## ma2 0.3638 0.0587 6.2020 0.0000
## constant -1470.2739 15191.7140 -0.0968 0.9229
##
## $AIC
## [1] 31.37181
##
## $AICc
## [1] 31.37305
##
## $BIC
## [1] 30.38498
fSarima <- sarima.for(TsVar02[1:1622], 140, 1, 1, 2, plot.all = TRUE)
fSarima$pred[1:20]
## [1] 10708749 10652147 10606869 10570099 10539723 10514149 10492185 10472931 10455715 10440030 10425494 10411823 10398801 10386266 10374098 10362206 10350520 10338990 10327576 10316251
#save prediction into df
group_S05_df[1623:1762,2] <- fSarima$pred
Doing the same process for Var03, we see that a 2,1,1 model is best. Again as before the box test results look good as do the ACF of the residuals.
library(astsa)
h <- 140
TsVar03 <- ts(group_S05_df$Var03, frequency = 5*52)
arimaFitVar03 <- auto.arima(TsVar03, seasonal = FALSE)
arimaFitVar03
## Series: TsVar03
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 0.7785 -0.1368 -0.6654
## s.e. 0.1458 0.0251 0.1461
##
## sigma^2 estimated as 0.8007: log likelihood=-2118.51
## AIC=4245.01 AICc=4245.04 BIC=4266.58
sarima(TsVar03, 2, 1, 1)
## initial value -0.102328
## iter 2 value -0.104048
## iter 3 value -0.110049
## iter 4 value -0.110131
## iter 5 value -0.110136
## iter 6 value -0.110149
## iter 7 value -0.110186
## iter 8 value -0.110279
## iter 9 value -0.110542
## iter 10 value -0.110723
## iter 11 value -0.110793
## iter 12 value -0.111166
## iter 13 value -0.111173
## iter 14 value -0.111195
## iter 15 value -0.111257
## iter 16 value -0.111381
## iter 17 value -0.111520
## iter 18 value -0.111771
## iter 19 value -0.112038
## iter 20 value -0.112038
## iter 21 value -0.112051
## iter 22 value -0.112053
## iter 23 value -0.112099
## iter 24 value -0.112133
## iter 25 value -0.112135
## iter 26 value -0.112146
## iter 27 value -0.112149
## iter 28 value -0.112150
## iter 29 value -0.112150
## iter 29 value -0.112150
## iter 29 value -0.112150
## final value -0.112150
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, optim.control = list(trace = trc, REPORT = 1,
## reltol = tol))
##
## Coefficients:
## ar1 ar2 ma1 constant
## 0.7799 -0.1370 -0.6671 0.0131
## s.e. 0.1451 0.0251 0.1454 0.0207
##
## sigma^2 estimated as 0.7991: log likelihood = -2118.3, aic = 4246.61
##
## $degrees_of_freedom
## [1] 1617
##
## $ttable
## Estimate SE t.value p.value
## ar1 0.7799 0.1451 5.3731 0.0000
## ar2 -0.1370 0.0251 -5.4585 0.0000
## ma1 -0.6671 0.1454 -4.5872 0.0000
## constant 0.0131 0.0207 0.6321 0.5274
##
## $AIC
## [1] 0.7802196
##
## $AICc
## [1] 0.7813741
##
## $BIC
## [1] -0.2073532
fitArimaV03 <- sarima(TsVar03, 2, 1, 1, details = FALSE)
fSarima <- sarima.for(TsVar03[1:1622], n.ahead = 140, 2, 1, 1, plot.all = TRUE)
fSarima$pred[1:20]
## [1] 89.69658 89.70958 89.72489 89.73975 89.75394 89.76767 89.78114 89.79445 89.80770 89.82090 89.83408 89.84726 89.86042 89.87359 89.88675 89.89991 89.91308 89.92624 89.93940 89.95256
#save prediction into df
group_S05_df[1623:1762,3] <- fSarima$pred
bcTsVar02 <- BoxCox(TsVar02, lambda = "auto")
fSarima <- sarima.for(bcTsVar02, 140, 1, 1, 3, plot.all = TRUE)
#Reset values
TsVar02 <- ts(group_S05_df$Var02, frequency = 5*52)
TsVar03 <- ts(group_S05_df$Var03, frequency = 5*52)
library(urca)
summary(ur.kpss(TsVar02))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 10.105
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
summary(ur.kpss(diff(TsVar02)))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 0.0054
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
summary(ur.kpss(TsVar03))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 8.3755
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
summary(ur.kpss(diff(TsVar03)))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 0.0658
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
print("Do we actually need a second order of differencing?")
## [1] "Do we actually need a second order of differencing?"
summary(ur.kpss(diff(diff(TsVar03))))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 0.003
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
print("Checking for seasonality:")
## [1] "Checking for seasonality:"
nsdiffs(TsVar02, test = "seas")
## [1] 0
nsdiffs(TsVar03, test = "seas")
## [1] 0
Checking ACF graphs for the p and q values.
#113
Acf(diff(TsVar02))
pacf(diff(TsVar02))
Acf(diff(TsVar03))
pacf(diff(TsVar03))
Looking at the plots is seems like a (1,1,4) model might be better for Var02 considering the number of spikes in the acf graph. Rerunning using that model, the box test results look better, althought the AIC numbers are virtually unchanged.
The 2,1,1 ARIMA model based on the acf graphs looks reasonable for Var03.
fSarima <- sarima.for(TsVar02, 140, 4, 1,4,plot.all = TRUE)
## Series: TsVar02
## ARIMA(3,1,4) with drift
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 drift
## 0.0818 0.8584 -0.1525 -0.5703 -1.0397 0.5069 0.1112 -41486.54
## s.e. 0.2240 0.0611 0.1669 0.2245 0.0889 0.2168 0.0786 16530.25
##
## sigma^2 estimated as 2.745e+14: log likelihood=-29243.06
## AIC=58504.12 AICc=58504.23 BIC=58552.63
## Series: TsVar03
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 0.7197 0.0683 -0.8283
## s.e. 0.2518 0.0450 0.2499
##
## sigma^2 estimated as 0.09362: log likelihood=-378.95
## AIC=765.9 AICc=765.92 BIC=787.46
## [1] "Subset of Predicted values: "
## [1] 23391972 25880151 25881126 26758001 26442267 27160232 26805375 27432028 27060353 27613212 27234991 27726531 27348929 27788880 27416942 27812975 27450175 27808391 27457041 27782336
## Warning in stats::arima(xdata, order = c(p, d, q), seasonal = list(order = c(P, : possible convergence problem: optim gave code = 1
## [1] "Subset of Predicted values: "
## [1] 12.97521 12.98033 12.98516 12.98920 12.99270 12.99580 12.99861 13.00121 13.00366 13.00599 13.00824 13.01043 13.01257 13.01468 13.01676 13.01883 13.02089 13.02293 13.02497 13.02700
## Series: TsVar05
## ARIMA(1,1,0)
##
## Coefficients:
## ar1
## -0.1634
## s.e. 0.0245
##
## sigma^2 estimated as 2.248: log likelihood=-2956.16
## AIC=5916.32 AICc=5916.33 BIC=5927.1
## Series: TsVar07
## ARIMA(0,1,0)
##
## sigma^2 estimated as 1.811: log likelihood=-2781.33
## AIC=5564.65 AICc=5564.66 BIC=5570.04
## [1] "Subset of Predicted values: "
## [1] 98.71268 98.75792 98.79945 98.84159 98.88363 98.92569 98.96774 99.00980 99.05185 99.09391 99.13596 99.17802 99.22008 99.26213 99.30419 99.34624 99.38830 99.43035 99.47241 99.51446
## [1] "Subset of Predicted values: "
## [1] 97.38118 97.42237 97.46356 97.50475 97.54594 97.58713 97.62832 97.66951 97.71070 97.75188 97.79307 97.83426 97.87545 97.91664 97.95783 97.99902 98.04021 98.08140 98.12258 98.16377
## Series: TsVar01
## ARIMA(0,1,0)
##
## sigma^2 estimated as 0.2557: log likelihood=-1194.84
## AIC=2391.68 AICc=2391.68 BIC=2397.07
## Series: TsVar02
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 0.4778 0.0138 -0.9523
## s.e. 0.0282 0.0274 0.0134
##
## sigma^2 estimated as 5.723e+13: log likelihood=-27974.37
## AIC=55956.73 AICc=55956.76 BIC=55978.3
## initial value -0.682128
## iter 1 value -0.682128
## final value -0.682128
## converged
## initial value -0.682128
## iter 1 value -0.682128
## final value -0.682128
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, optim.control = list(trace = trc, REPORT = 1,
## reltol = tol))
##
## Coefficients:
## constant
## 0.0122
## s.e. 0.0126
##
## sigma^2 estimated as 0.2556: log likelihood = -1194.37, aic = 2392.74
##
## $degrees_of_freedom
## [1] 1620
##
## $ttable
## Estimate SE t.value p.value
## constant 0.0122 0.0126 0.9684 0.333
##
## $AIC
## [1] -0.3630232
##
## $AICc
## [1] -0.3617856
##
## $BIC
## [1] -1.359699
## [1] "Subset of Predicted values: "
## [1] 36.92216 36.93432 36.94648 36.95864 36.97080 36.98295 36.99511 37.00727 37.01943 37.03159 37.04375 37.05591 37.06807 37.08023 37.09239 37.10455 37.11671 37.12886 37.14102 37.15318
## [1] "Subset of Predicted values: "
## [1] 10000727 11596210 12404977 12811308 13014265 13114379 13162487 13184297 13192810 13194598 13192986 13189654 13185453 13180812 13175949 13170973 13165941 13160879 13155804 13150720
## Series: TsVar05
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.4661
## s.e. 0.0230
##
## sigma^2 estimated as 0.844: log likelihood=-2162.24
## AIC=4328.48 AICc=4328.48 BIC=4339.26
## Series: TsVar07
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.4553
## s.e. 0.0232
##
## sigma^2 estimated as 0.8512: log likelihood=-2169.12
## AIC=4342.25 AICc=4342.25 BIC=4353.03
## initial value 0.008912
## iter 2 value -0.080612
## iter 3 value -0.084606
## iter 4 value -0.085420
## iter 5 value -0.085437
## iter 6 value -0.085437
## iter 7 value -0.085437
## iter 8 value -0.085437
## iter 8 value -0.085437
## iter 8 value -0.085437
## final value -0.085437
## converged
## initial value -0.085407
## iter 2 value -0.085407
## iter 3 value -0.085407
## iter 3 value -0.085407
## iter 3 value -0.085407
## final value -0.085407
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, optim.control = list(trace = trc, REPORT = 1,
## reltol = tol))
##
## Coefficients:
## ma1 constant
## -0.4673 0.0131
## s.e. 0.0230 0.0122
##
## sigma^2 estimated as 0.8429: log likelihood = -2161.65, aic = 4329.31
##
## $degrees_of_freedom
## [1] 1619
##
## $ttable
## Estimate SE t.value p.value
## ma1 -0.4673 0.0230 -20.3260 0.0000
## constant 0.0131 0.0122 1.0813 0.2797
##
## $AIC
## [1] 0.8315002
##
## $AICc
## [1] 0.8327424
##
## $BIC
## [1] -0.1618519
## [1] "Subset of Predicted values: "
## [1] 48.65134 48.66448 48.67763 48.69077 48.70391 48.71705 48.73020 48.74334 48.75648 48.76962 48.78277 48.79591 48.80905 48.82219 48.83534 48.84848 48.86162 48.87476 48.88791 48.90105
## [1] "Subset of Predicted values: "
## [1] 48.42725 48.44001 48.45276 48.46552 48.47827 48.49103 48.50378 48.51653 48.52929 48.54204 48.55480 48.56755 48.58031 48.59306 48.60582 48.61857 48.63133 48.64408 48.65683 48.66959
A promising and more complicated technique involves a process called deep learning. Deep learning can provide very accurate results at the cost of computer processing power. To build and train a model can take hours or days compared to other ML techniques. A deep learning model popular for time series is called Long Short-Term Memory (LSTM). Like ARIMA, this process requires data transformation to give the model the best chance at accurate forecast predictions and involves the constant tinkering of model parameters such as batch sizes, number of iterations, a loss function, optimizing function and other options. Each change of the model will need to be retrained.
Our attempt didn’t provide desirable results. The errors we calculated such as the RSME were high so model couldn’t be trusted to provide a best fit to the data. To simplify the model, the daily high and low price were averaged so the model only had to forecast one variable.
Caption for the picture.
Each variable from each series in the dataset provides its own challenges. There isn’t a one-solution-fits-all for every time-series dataset though ARIMA may come close. It can eliminate residual autocorrelation or white noise, and it can take into account long-term trends and seasonality. It can auto-adjust and vary its p, d, and q parameters to build the best model.
Of course, many forecasting models will never be perfect. But the progressive to improve them may lead to the discovery of other characteristics from the data that you would have not seen otherwise. The goal to turn the random into predictable is exhausting but we have found the ARIMA model to be a proven starting point for any forecasting model.